-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TAS: KEP update for API #3237
TAS: KEP update for API #3237
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
98a9426
to
e1a1325
Compare
Groups []TopologyAssignmentGroup `json:"groups"` | ||
Domains []TopologyDomainAssignment `json:"domains"` | ||
|
||
// levels is an ordered list of keys denoting the levels of the assigned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TopologyUngater needs the information about the keys to construct full node selector when ungating pods.
This information is also present in the Topology object, but I think it is useful to also put it here.
Otherwise we need to complicate the code of the Ungater to lookup via ResourceFlavor to the Topology object, and deal with possible deletions / changes of the API objects in the meanwhile.
The list has at most 5 elements (levels), so the Workload object size should not be an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, put this field above domains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// +kubebuilder:validation:MinItems=1 | ||
NodeLabels map[string]string `json:"nodeLabels"` | ||
type TopologyDomainAssignment struct { | ||
// values is an ordered list of node selector values describing a topology |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be clearer if you duplicated this description in TopologyAssignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I've updated the TopologyAssignment description to include the information on how the node selector is specified. PTAL.
e1a1325
to
8980d61
Compare
// | ||
// +required | ||
// +listType=atomic | ||
// +kubebuilder:validation:MinItems=1 | ||
// +kubebuilder:validation:MaxItems=5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be a bit more generous here and bump it to 8, so that it covers all possible use cases, and and the same time has a sane max limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, done
8980d61
to
9a36b3a
Compare
/lgtm |
LGTM label has been added. Git tree hash: cb7590a3326b99750053131557778fc63df34b2f
|
9a36b3a
to
5950e78
Compare
/lgtm |
LGTM label has been added. Git tree hash: b6669556f015d2be83940c4a9d0a8277537c2614
|
// +kubebuilder:validation:MinItems=1 | ||
NodeLabels map[string]string `json:"nodeLabels"` | ||
// +kubebuilder:validation:MaxItems=8 | ||
Values []string `json:"values"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the values
are a slightly generic and It's hard to recognize the objective for this field since the topologyDomainAssignment
does not have the context for the nodeSelector.
Will we receive the requests except nodeSelector keys here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the values are a slightly generic and It's hard to recognize the objective for this field since the
topologyDomainAssignment does not have the context for the nodeSelector.
I agree, this is why I mention the node selector in the comment. We could name the field nodeSelectorValues
but it seems verbose.
Will we receive the requests except nodeSelector keys here?
The node selectors are create by combining only the keys (.levels field) and values (.domains.values) fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we receive the requests except nodeSelector keys here?
The node selectors are create by combining only the keys (.levels field) and values (.domains.values) fields.
Sorry, I meant the nodeSelector values.
I agree, this is why I mention the node selector in the comment. We could name the field nodeSelectorValues but it seems verbose.
In that case, How about the levelValues
since these values are for the levels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, How about the levelValues since these values are for the levels?
As a name it sgtm, but maybe the only potential downside is grpc size if we have many domains assigned? I guess for big workloads it could be 10k domains or more (if the lowest level is individual node). WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that levelValues
does not significantly impact on the gRPC message size since the field name increase just 0.05 MiB in the 10k domains situations.
Additionally, I guess that the TAS will be used for the Top of Rack Switch (ToR) or Leaf or Spine switches instead of Nodes since the topology often needs to be considered based on the network topology, right?
Let me know if you have the actual or imaginable situations where we need to construct the topology based on the Nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, I guess that the TAS will be used for the Top of Rack Switch (ToR) or Leaf or Spine switches instead > of Nodes since the topology often needs to be considered based on the network topology, right?
Later maybe yes, for now we just want to ensure the pods are lending on closely connected nodes.
I think there could be use cases to assign at the level of nodes to ensure there is no issue with fragmentation of quota. For example, if you assign at the level of rack, it could happen that a Pod can fit in a rack, but cannot fit on an individual node. I guess it will need to depend on the user preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that levelValues does not significantly impact on the gRPC message size since the field name increase just 0.05 Mbit in the 10k domains situations.
Yes, it is not much, but the overall API object size in etcd is 1.5Mi which is not much too, so the gain around 3% of the max size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, if you assign at the level of rack, it could happen that a Pod can fit in a rack, but cannot fit on an individual node. I guess it will need to depend on the user preference.
I feel that this usage beyonds the initial TAS motivations. But, I know that we received similar requests here: #3211
Yes, it is not much, but the overall API object size in etcd is 1.5Mi which is not much too, so the gain around 3% of the max size.
Yes, that's true. Here, key problem is which disadvantages should we take. I think that both selection have the below disadvantages:
values
: This lacks context and background that this field is for the values for nodeSelectors defined in the level. After we read the API document, we can understand the purpose for this field.levelValues
: This will increase the Workload object size. This may cause the etcd performance issues.
Based on the fact that Workload object is for Kueue developer and not exposed to the batch users, I am leaning toward accepting values
. In that case, could you mention this values
vs levelValues
discussion on the Alternative? I guess that we can revisit this alternative based on user feedback like batch admins often struggle caused by this field name?
// +required | ||
// +listType=atomic | ||
// +kubebuilder:validation:MinItems=1 | ||
Groups []TopologyAssignmentGroup `json:"groups"` | ||
// +kubebuilder:validation:MaxItems=8 | ||
Levels []string `json:"levels"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which objectives does this field "Storing desired state" or "temporary cache for the computing by topologyUngator"?
I guess that this is for the "temporary cache for the computing by topologyUngator".
In that case, could we put this information in the internal cache instead of CRD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is desired state to be reached by the Pods. We cannot use internal state, because then it would be lost on Kueue restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean that the topologyUngator can not obtain levels from Topology CR when the Kueue restarts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see what you mean. Similar question was asked in this comment.
It makes the implementation of TopologyUngater simpler (as it would require lookups via RF API Topology API, and dealing with changes to the object), at the very small cost in terms of object size as the number of levels is quite limited.
Also, I think it is conceptually consistent with what we do here as we store the mapping between resources and flavors, rather than reading resources from the CQ API object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as it would require lookups via RF API Topology API, and dealing with changes to the object
I think that this could be resolved by the dedicated internal cache to store the relarationships between RF and Topology.
at the very small cost in terms of object size as the number of levels is quite limited.
Yes, I agree with you. My doubt was that the field is just cache since we can obtain from another objects.
Also, I think it is conceptually consistent with what we do here as we store the mapping between resources and flavors, rather than reading resources from the CQ API object.
Uhm, I see. Indeed the levels (new field) and flavors (existing field) field is stored in the Workload status, and it seems that reduce the multiple API calls (RF -> Topology).
When we graduate API version to v1beta2, we may want to restructure the Workload status field so that we can store the actual assigned flavor and topology information in a same level field.
Anyway, I'm ok with adding the levels here. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this could be resolved by the dedicated internal cache to store the relarationships between RF and Topology.
sure, it looks possible, I get the point, but I think it would complicate the implementation to maintain the cache, rather than reading from the object.
When we graduate API version to v1beta2, we may want to restructure the Workload status field so that we can store the actual assigned flavor and topology information in a same level field.
Sure, I'm ok to add it to the KEP to re-evaluate this decision.
Anyway, I'm ok with adding the levels here. Thanks.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we graduate API version to v1beta2, we may want to restructure the Workload status field so that we can
store the actual assigned flavor and topology information in a same level field.
Sure, I'm ok to add it to the KEP to re-evaluate this decision.
let me know if you want that to be added, though I think it is unlikely to be dropped. For example, if we read from cache the keys could be changed in the meanwhile, and then the domain assignement would be corrupted. If we know the keys at the moment of assignment we could compare them with the ones in the cache (maintained based on the Topology API) and evict the workload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, it looks possible, I get the point, but I think it would complicate the implementation to maintain the cache, rather than reading from the object.
Yeah, that's true. I think that this is a trade-off between implementation costs and API maintenance costs since the dedicated cache could prevent exposing API and easily break the structure, but it has implementation costs.
Anyway, I do not claim a dedicated cache now.
For example, if we read from cache the keys could be changed in the meanwhile, and then the domain assignement would be corrupted.
This could be preventented by the event based cache updating same as today cache package (/pkg/cache).
I'm wondering if we can put the dedicated cache mechanism instead of the status field on the alternative section.
But this is nonblocking of this PR merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can update the alternative section in a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sure.
5950e78
to
864e36a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the update!
There are not blocker, and some commenters can be addressed in the follow-ups.
/lgtm
/approve
LGTM label has been added. Git tree hash: db598e6a06d554912fabf11c8e6fba23db305b46
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mimowo, tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
To reflect API decision changes during implementation:
domain
rather than group consistentlyWhich issue(s) this PR fixes:
Part of #2724
Special notes for your reviewer:
Does this PR introduce a user-facing change?